In this project, I will use the data of Red Wine Quality to perform exploratory data analysis using R to know what influences the quality of red wines.
First, we’ll have a look at the data.
## [1] 1599 13
## Observations: 1,599
## Variables: 13
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...
There are 1599 observations and 13 variables. The X variable is an index for each observation in the dataset, while the other variables are chemical properties and the quality of the red_wine.
Let’s look at the distribution the variables:
## Using as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Almost all the histograms above are normally distributed except for residual sugar and chlorides where they appear to be right skewed. Outliers can cause that skewness in the distribution. As we can see in the previous boxplots, these two variables have so many outliers.
To visualize the variability of the variables, we can use a boxplot for each one:
From the visualizations above, we can see the minimum and maximum values of each variable along with the median and outliers.
Let’s look closer into them:
Residual Sugar:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
For the residual sugar, there are many outliers. Most of the data falls between 0.5 and 4, samples that have more that that are outliers.
Alcohol:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
In alcohol content, there are not much outliers. Most samples contain alcohol between 9 and 10. Only a few have greater than 11.
Sulphates:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The mean value of sulphates is 0.66, values higher than 1 are considered outliers.
Fixed Acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The fixed acidity has a mean of 8.32, and most samples have fixed acidity values between 7 and 9. Samples that have values higher than 12 are extreme outliers.
Volatile Acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Volatile acidity is normally distributed. The mean value is 0.5 and there are only few outliers.
Citric Acid :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The values of citric acid are between 0 and 0.75, there’s only one outlier with a value of 1. The mean is 0.27 .
Chlorides:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
If we ignore the outliers, the chlorides values would be normally distributed. All values are very close and fall between .07 and .09 with a mean of .087 .
Free Sulfur Dioxide :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The free sulfur dioxide distribution appears to be right skewed. Most values are between 7 and 21. Values higher than 40 are extereme outliers.
Total Sulfur Dioxide:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Total sulfur dioxide is also right skewed. The mean value is 46 and there are extreme outliers that have values greater than 100.
Density:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The density is normally distributed with a mean value of 0.997.
pH:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The pH distribution is also normal. The first and third quartile are 3.2 and 3.4 respectively, with a mean of 3.3. Values outside of that range are outliers.
There are strong correlation coefficient between some of the variables in the dataset:
From this scatter plot, it is clear that there is a strong negative correlation coefficient between pH and fixed acidity.
There is a strong positive correlation coefficient between fixed acidity and density.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
There’s a positive correlation coefficient between quality and sulphates. Since most samples have quality rating of 5 and 6, the mean value of sulphates is around 0.66 .
The dataset contains 1599 observations of red wine samples, and 12 variables that discribe the chemical properties of each sample along with its quality.
The aim of this data exploration and analysis is to see what could affect the quality of wine. So the main feature of interest in this dataset is the quality.
All the chemical properties will help support the investigation. These properties will definitely have an effect on the quality.
No, no new variables were created.
There was no any unusual distributions.The dataset is tidy and complete, no change or adjustment was needed.
We can visualize the relationship between the variables and their correlations in the matrix below:
Since we’re interested in the quality of the wine sample, we will focus on the highest correlation coefficients with quality. The top two are alcohol and sulphates with 0.48 and 0.25 coefficients respectively.
Let’s look into how quality is effected by alcohol:
## red_wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## red_wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## red_wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## red_wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## red_wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## red_wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
It appears that the more alcohol content, the better quality samples get. The mean alcohol value in samples that have quality 8 rating is 12.09
From the previous observation, it comes into view that the strongest relationship with quality is the amount of alcohol the sample has. Also, there are other features that has a great effect on quality like sulphates and citric acid.
I assumed that the more alcohol content in the sample, the higher sugar it will contain. Interestingly, in the dataset of red wines it appears that this is not the case. There is no relationship between the alcohol content and the amount of sugar.
## round(red_wine$alcohol): 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 1.700 1.800 1.833 1.950 2.100
## --------------------------------------------------------
## round(red_wine$alcohol): 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.30 1.90 2.10 2.72 2.60 15.50
## --------------------------------------------------------
## round(red_wine$alcohol): 10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.100 2.406 2.600 13.900
## --------------------------------------------------------
## round(red_wine$alcohol): 11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 2.000 2.200 2.507 2.600 9.000
## --------------------------------------------------------
## round(red_wine$alcohol): 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.728 2.800 12.900
## --------------------------------------------------------
## round(red_wine$alcohol): 13
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 2.100 2.400 2.739 3.050 6.400
## --------------------------------------------------------
## round(red_wine$alcohol): 14
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 1.800 1.800 2.131 2.200 4.300
## --------------------------------------------------------
## round(red_wine$alcohol): 15
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.5 7.5 7.5 7.5 7.5 7.5
We can observe from the summary above that when alcohol is between the values 9 and 13, the sugar will be higher than when alcohol is 14.
The relationship between alcohol content and quality raiting.
Let’s see how alcohol, volatile acidity and citric acid are related to red wines quality:
It appears that having high alcohol and citric acid with low volatile acidity, results in a high quality rating in red wines.
Now let’s look closer on how volatile acidity related with quality:
The best quality wine samples are those who have high alcohol content and low volatile acidity.
It is obserevd that the samples that have high alcohol content along with high citric acid are are the best quality wines.
The high quality of red wines contain lower residual sugar levels.
In this red wine samples dataset, the lowest quality rating is 3 and the highest is 8. The quality is normally distributed, most samples have ratings of 5 and 6 and only few have less or more than that.
The graph above clearly shows that the red wine with high quality (green) appears to be in the left side where the volatile acidity is low. We can see that quality rating 5, 7 and 8 have densities higher than 3, whereas the rest falls between 1 and 2.5 .
As shown in the graph above, the higher quality (darker spots) appear in the upper right part of the graph where we have greater citric acid and larger alcohol content.
The most challenging part in this project was that I had no background knowledge about wines and their quality since I come from a country where we don’t drink alcoholic beverages. I chose this dataset to enrich my knowledge and learn more about how wines are considered high quality. Since this topic is new to me, I found everything in this exploratory data analysis to be very interesting.
In this red wine samples dataset, it appears that the best wine quality contains high alcohol, citric acid, sulphates (positive correlations) and low volatile acidity (negative correlation). This dataset has no samples that are rated below 3 or above 8. Having a larger sample that covers all quality rating range would further improve the analysis. Prediction models could be done to predict the quality of wine and test these trends.
Udacity lessons Stack overflow Quick-R by DataCamp https://www.statmethods.net/index.html Plotly https://plot.ly/feed/?q=plots%20in%20r https://plot.ly/ggplot2/geom_bar/ https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf http://seananderson.ca/2014/09/13/dplyr-intro.html https://www.rstudio.com/products/rpackages/ https://towardsdatascience.com/top-r-libraries-for-data-science-9b24f658e243 http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf http://www.r-bloggers.com/gridextra-%E2%80%93-multiple-plots-from-ggplot2/ http://svitsrv25.epfl.ch/R-doc/library/ggplot2/html/labs-df.html http://www.r-bloggers.com/ggplot2-cheatsheet-for-visualizing-distributions/ Project review